System and method for aggregation and analysis of information

ABSTRACT

The invention provides a system and method for automated data analysis in which data agents are located and operate at each member site or data source (i.e., locally). These agents access stored data at the data source or member sites, process the data and also aggregate the results. The aggregated results from each of the member sites are then forwarded to and further aggregated at a central analytic hub. The central analytic hub contains a centralized application which can further aggregate each of the aggregated results and perform a final analysis. These results are then delivered to the requestor without any ability to identify individual data sources, or records from those sources.

CROSS REFERENCE TO RELATED APPLICATIONS

The present invention is a continuation-in-part of U.S. application Ser.No. 11/493,558 filed Jul. 27, 2006 (herein incorporated by reference inits entirety) which is a continuation of U.S. application Ser. No.11/271,824 filed Nov. 14, 2005, which is a Continuation of U.S.application Ser. No. 11/080,980 filed Mar. 16, 2005, which is anonprovisional of U.S. Provisional Application No. 60/553,132 filed onMar. 16, 2004.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a system and method for exchanging,integrating and analyzing information from multiple sources, withoutrisking the divulging of potentially confidential information from anyof those sources. Specifically, the present invention relates to the useof a data agent that collects, analyzes and aggregates informationrelated to a data set of interest and forwards the results in a formthat can be combined with like data from other sites, and withoutdivulging confidential information contained in the data set.

2. Discussion of the Related Art

The analysis of data generally requires a sufficient set of data pointsto determine whether results represent real correlations or whether theyrepresent random coincidence. In many industries, there are questionsthat cannot be answered by any one institution because the size andvariation of its dataset is insufficient. Competitors, collaborators,and regulators, may have mutual interest in sharing data to provide ajoint body of information for answering questions in which they areinterested. However, due to competitive, regulatory, or other concernsof trust, institutions may be reluctant to disclose such data,particularly identifying data. Moreover, in other regulated industries,such as healthcare or finance, or in industries where privacy isimplied, sharing of certain data is prohibited. Accordingly, a currentneed exists for a methodology for exchanging, integrating and analyzinginformation using a technique that can overcome these concerns andprohibitions and provide data of sufficient size and variation with theadded benefit of ensuring anonymity of data providers.

Current techniques and systems attempt to address these confidentialityand disclosure problems through the use of various data filters thatattempt to forward relevant data while preventing the dissemination ofprivate information, by removing personal identifiers. The data filtermay be located at a data source. For example, data may be collected froma hospital using an application that strips patient information from thedata records before sending the data records for statistical analysis.Alternatively, other known data stripping utilities operate at a dataanalysis location, removing confidential information from data acquiredfrom distant location, either before or after statistical analysis ofthe acquired data. The problem with these methods is four-fold. First,the anonymization techniques used are often reversible given otherexternal information, or are insufficient to completely anonymize theindividual. Second, the data records themselves are no longer undercontrol of the source site, and so could be used inappropriately. Third,to fully anonymize the data may require removal of important fieldsother than explicit identifiers. This loss of fields or variables mayput constraints on the utility of anonymized data in a pooled analysis.Fourth, removing data that might identify an individual might alsoimpede the ability to find and analyze rare events. For meaningfulanalysis of rare events, which by definition occur infrequently, alldata points should be included because sampling techniques areinappropriate and may miscount or otherwise distort the occurrence ofthe rare events. Not only might the data be removed forde-identification, but the analysis cannot be performed at individualsites and then combined, because rare events will not show up assignificant in local analyses.

SUMMARY OF THE PRESENT INVENTION

In response to these and other needs, the present invention provides asystem and method for automated data analysis in which data agents arelocated and operate at each member site or data source (i.e., locally).These agents access stored data at the data source or member sites,process the data and also aggregate the results. The aggregated resultsfrom each of the member sites are then forwarded to and furtheraggregated at a central analytic hub. The central analytic hub containsa centralized application which can further aggregate each of thelocally aggregated results and perform further analysis. These resultsare then delivered to the requestor without any ability to identifyindividual data sources, or records from those sources. Because the dataagents at the member sites only forward aggregate data and not any ofthe actual records from the member sites, there is virtually no risk ofdisclosing confidential information contained in the member sites. Thedata agents are designed to provide the data aggregates needed for anyspecific request without custom programming. Moreover, because therequest is processed through a central analytic hub, the source of therequest may remain anonymous to the member sites, depending upon theprocedures established at the hub.

In accordance with the invention, a requesting entity can forward arequest for analysis to the central analytic hub. That request istranslated into requests to each data agent residing at each member sitefor data aggregates. No reprogramming is required for each request. Whenthe data aggregates have been collected from each agent, finalaggregation and analysis is performed at the hub, and the resultsdelivered to the requestor.

Unlike most standard data warehousing techniques, the data collection,analysis and aggregation system of the present invention does not gatherindividual records for processing and storage at a central site Zuzek.Instead, the data collection, analysis and aggregation system analyzesindividual requests to determine the data needed to fulfill that requestand then analyzes and aggregates data at each source site to the maximalextent possible before transferring the data aggregates at a centrallocation to perform final analysis. By aggregating data at source sites,and sending only summary data to the analysis site, no individualrecords or identities are exposed. The data collection and aggregationsystem (hub) will conceal the source site identity and requestorinformation from the summary data. Thus, even the requestor ofinformation will remain anonymous. Consequently, the present inventionprovides a unique advantage that even mutually distrustfulorganizations, such as competitors, can participate in such a dataexchange, and benefit from industry-wide analyses.

The data collection and aggregation system in accordance with theinvention is used in the context of an exchange model, in which a set ofexchange members agree to host a service to summarize their data inresponse to specific requests and provide those summaries to theexchange analysis servers. In return, exchange members are granted theability to request analyses from the exchange, receive compensation forthe provision of information, or are considered compliant for regulatorypurposes.

Once a request for analysis is made to the exchange, it is processed toidentify (1) the data sources needed, (2) the variables to be analyzed,and (3) the strata (or bins) in which to group each variable. Requestsare then sent to the agent located at the member sites for theinformation needed, along with (1) instructions on what data to collectand how to bin the requested data, and (2) instructions on how toaggregate the data. The aggregate requested will not include variablesthat carry identifying information.

When an agent residing at a member site receives the request, itretrieves the data needed for processing the request, codes it wherenecessary according to predefined exchange rules or standarddictionaries, bins the data as specified, and computes the summary datarequested. The agent then sends this data aggregate to a centralanalysis hub. That hub may be a single central hub, or the exchange maybe configured with regional analysis hubs that perform another level ofaggregation or analysis before forwarding to a central hub. Thecollected data aggregates can optionally be retained at the member sitesin case follow-up requests are made.

The results of data analysis are then returned to the initial requestor.If the requestor wishes to drill down on the results, a follow-uprequest can be made. That secondary request is forwarded to the membersites and may include pooled results from the first round, as well asmake use of the saved aggregates from the first request. The follow-upmay require drill-down or re-slicing of the data aggregate computed forthe original request.

Thus, the present invention provides a system and method for addressingmany of the shortcomings of present data collection and aggregationtechnologies. Specifically, the present invention provides the novel useof data aggregation methods, particularly data aggregation and summarytable technology, to protect the privacy of data without losing criticalinformation needed for analysis. While existing data collection andaggregation techniques may build data slices from a common warehouse,the present invention provides the novel use of paired aggregation anddisaggregating techniques to combine data from distinct data sources toa common specification, followed by the merger of the data aggregatesfrom the separate data sources.

The present invention differs from known data collection techniques. Thepresent invention does not reply upon taking only a sample of a data setor otherwise reduce data size or data distribution for performance orthroughput improvements. Sampling techniques find a valid subset of datapoints to reduce data size, to improve performance, or conform tolimited resources. In contrast, the present invention provides formeaningful statistical analysis from all data points of interest.

Similarly, the present invention differs from known caching techniquesthat pull data from a central warehouse and store previous queries ordata subsets in caches close to the user. Instead, the present inventiontakes data aggregates from multiple sources and conveys them through acentral site for processing and analysis.

Likewise, the present invention further differs from other known datacollection and sampling techniques, such as parallel query techniquesthat use multiple sites to run queries on slices of the data in parallelto improve throughput; meta-analysis that combines existing results byweighting techniques; and Private Information Retrieval techniques thatprovide individual data points while protecting identity and source ofthose data points. The present invention does not provide individualdata points, or the potentially confidential information contained inthese data points. Instead, the systems and methods of the presentinvention use intelligence about the mining or analytical techniques toguide data aggregation whereby individual record identifying informationis removed while retaining all statistical information relevant to theanalysis. In one embodiment, the present invention can be adapted toemploy known techniques for removing aggregations below thresholds.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the present invention and advantagesthereof may be acquired by referring to the following description takenin conjunction with the accompanying drawings, in which like referencenumbers indicate like features, and wherein:

FIG. 1 illustrates a flowchart depicting steps in a confidential dataaggregation method in accordance with embodiments of the presentinvention;

FIG. 2A is a block diagram of a system for data aggregation using anexchange hub in accordance with an embodiment of the invention;

FIG. 2B is a block diagram illustrating the multi-tiered operation ofthe exchange hub in accordance with an embodiment of the invention;

FIG. 3 is a block diagram illustrating operation of an exchange memberin accordance with an embodiment of the invention; and

FIG. 4 is a block diagram illustrating operation of the exchange hub ingreater detail.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As depicted in FIG. 1, the present invention—provides a data aggregationmethod 100 for the confidential and effective collection of data. Thedata aggregation method 100 starts at step 105 where a request isformulated and sent by a requesting entity. The process then moves tostep 110 when at least one formulated data request is received usingknown techniques. The request is received from a requestor to a hub orcentral computer. The received data request defines the desired subjectmatter and scope of the data collection. For example, the data requestmay ask for aggregated or statistical results related to the use of aspecified medical drug to address a specified condition in a specifiedarea over a specified period. As described in greater detail below, therequest is generally in an electronic format and received at the centralcomputer.

In the next step, step 120, the hub or central computer processes and/ortransforms the request, which can involve reformulating the request sothat it can be understood by data sources and then transmits theprocessed request to the data sources. Thus, step 120 entailstransmitting the processed request as needed to access relevant data inthe other locations. Returning to the above example of a request, thecentral computer may transform a request about a specified medical drugfor a specified condition in a specified area over a specified periodinto a series of Boolean expressions that defines the data search. Itshould be appreciated that the transformation step may be adapted asneeded to access the relevant data contained at the remote location.Similarly, the transformed search request may be used to access previousanalyses that are stored at the hub or other location.

The process then moves to step 130. In step 130, the other locations,including other member sites, receive and process the data request. Thisstep generally includes the other locations searching for relevant dataand then processing this data. In this step, the other locations processand analyze the data to determine certain statistics. Continuing withthe previous example, each location may, for example, produce andanalysis and/or statistics that address the effectiveness andside-effects of the specified drug for the specified condition in thespecified area over the specified period according to the data containedin that remote location. This analysis of the data that resides at eachremote location is generally performed using known techniques.

The process then moves to step 140, where the analyzed data areaggregated at each remote location and then the aggregated data is readyto be sent to the hub. It should be noted that the analyzed data fromthe remote locations is never the raw information or data, but only theanalyzed data, the statistical data or the aggregates. In this way, itis ensured that confidential data is not improperly divulged. Returningto the above example, the identity of patients receiving the specifieddrug for the specified condition in the specified area over thespecified period would not be used or distributed. Even where the searchentailed some type of personal data, such as age, race, sex, etc., thisdata would not be individually available for each patient, but rather,only present as part of the analyzed or statistical results. In otherwords, the aggregate data would provide analyzed or statisticalinformation on personal data, such as percentages of people falling intocertain categories, but no individual patient's data would ever betransmitted or exposed. The aggregate analyzed data is generally derivedusing known techniques, and may vary according to the techniques used instep 130 to produce the data analysis.

In one embodiment of the invention, the data collection and aggregationstep 130 and 140 may operate in an iterative fashion. Thus, in step 150,a remote location may access and aggregate analyzed and/or statisticaldata about information contained in secondary locations, and thenforward this aggregated analyzed and/or statistical data to the centrallocation or hub where it can be further aggregated with data collectedfrom other remote locations, which may likewise be collecting andanalyzing data from secondary locations. In this way, data analysis frommultiple locations may be compiled and analyzed in an efficient andconfidential matter. This embodiment can be considered multi-tiered asthere may be multiple levels of hubs and remote locations.

The process then moves to step 160 where the central location or hub canthen aggregate and further analyze the collected analyzed and/orstatistical data from each remote (local) location with any analyzedand/or statistical data already residing at the central location or hub,such as previously analyzed data. Thus, the hub is capable of performingadditional analysis of the data aggregated from various remotelocations. Again, the aggregation is generally performed using knowntechniques and may vary according to the techniques used in steps 130and 140 to produce the data statistics. At this point, if the hubdetermines that additional aggregated data from the remote locations isrequired in order to fulfill the original request, the process moves tostep 165 which returns the process to step 130.

The central location or hub can then forward the aggregate analyzed datato the requestor in step 170 using known means. The requestor maysubsequently use these results to modify the initial request and torepeat the steps in data aggregation method 100. For example, therequestor may change the search terms as needed to refine the results orto collect additional analyzed data and/or statistics.

Turning now to FIG. 2A, an aggregation data exchange 200 forimplementing the data aggregation method 100 is now described. Thefollowing discussion of the aggregation data exchange 200 refers to asystem that can handle both non-members 210 and members 220 and 300. Theterm “member” is used herein to generally refer to locations that agreeto share data results with other locations in the aggregation dataexchange 200. Similarly, the term “requesting” is used in thisdiscussion to mean a request for specific statistical or analyzedinformation about data stored at various remote locations in theexchange. Thus, the non-member 210 merely requests data from locationswithout likewise providing access to its own data, whereas therequesting member 300 both initiates its own requests and can respond torequests from the non-member requestor 210 or from other members 220.The requestor, 210 or 300, forwards a request 201 to a central location,or exchange hub 400. The exchange 400 processes the request 201 andforwards the request to the non-requesting members 220.

Each of the non-requesting members 220 receives the request 201 andprocesses resident data to produce desired analyzed data or statisticsbased upon the data that resides within each respective non-requestingmember 220. This aggregated and analyzed data, which can be referred toas results 204, has been anonymized and is transmitted to the centrallocation or exchange hub 400. The results 204 includes other data asneeded for the operation of the exchange 200. For example, the results204 may include an accounting or bill for the activities of thenon-requesting members 220 or information as needed to further processthe transmitted results 204. The specific operation of the members 220and 300, along with the statistical analysis used by the members 220 and300, is described in greater detail below in FIG. 3 and the associateddiscussion.

The exchange hub 400 then further aggregates and/or analyzes the results204 collected from the non-requesting members 220. As described above inthe discussion of the data aggregation method 100, the central exchangeor hub 400 may combine the results 204 with information already storedat the hub 400, such as data collected in previous searches. Thesecombined results may be referred to as final results 205. The exchangehub 400 then forwards the aggregated and analyzed data, or final results205, to the requestor 210 or 300. The specific operation of the exchangehub 400 is described in greater detail below in FIG. 4 and theassociated discussion.

Where the non-member requestor 210 has initiated the data search, theexchange hub 400 forwards the final results 205 to the non-memberrequestor 210. The non-member requestor 210 generally includes auser/application 211 who originated the request 201. The non-memberrequestor 210 further includes an analytic engine 212 that processes andinterprets the final results 205, as needed for local use in the form ofa report of aggregated data. In one embodiment, the analytic engine 212may communicate with a billing and credit module (not shown) which maybe located at hub 400 in order to further process the request, which mayfor example involve using known techniques to create an accounting forthe data received by the requestor. For example, such a billing andcredit module can identify the non-requesting members 220 producing thefinal results 205 and produce an invoice for the service provided by thenon-requesting members 220.

It should also be understood that FIG. 2A supports embodiments of theinvention whereby the interaction between the exchange hub 400 and thenon-requesting members is iterative in nature. In accordance with theseembodiments, when the exchange hub 400 receives aggregated data from thenon-requesting members 220, it can analyze and aggregate the receiveddata and determine that more information is needed to fulfill therequest. In this case, the exchange hub 400 can direct a further requestfor aggregated data to each of the non-requesting members 220. Thisiterative process can continue indefinitely until the exchange hubdetermines that it has received data meeting the request.

Referring now to FIG. 2B, a multi-tiered aggregation exchange hub 200′is presented. Thus, in this embodiment, a non-requesting member 220′,which is a data source, may seek additional data from another set ofdata sources (secondary sites 230). Thus, this embodiment illustrates amulti-tiered data aggregation, analysis and retrieval system. As shownin FIG. 2B, in the aggregation exchange hub 200′, one or more of thenon-requesting members 220′ forwards a request 201 to one or moresecondary sites 230 that operate in a similar fashion to thenon-requesting members 220. These secondary sites 230 then analyzeresident data and return analyzed and or statistical results 203′ to theassociated non-requesting member 220′, which then forms aggregatedresults 204 based upon the results 203′ from the secondary sites 230with data aggregated and analyzed within the non-requesting member 220′.Thus, in this embodiment, the non-requesting member 220′ is acting asboth a hub, because it is gathering data aggregated from other remotelocations, and also as a non-requesting member through its aggregationand analysis of resident data.

The operation of a requesting exchange member 300 (as shown in FIG. 2A)is described in greater detail in FIG. 3. As shown in FIG. 3, a request201 is sent from the requesting member 300 to the exchange hub 400 viafirewall 380. This request 201 may be based upon a request for datareceived by the requesting member 300 from a user (not shown). Such auser can access the requesting member 300 via any number of knowinterfaces. As described in connection with FIGS. 2A and 2B, finalresults 205 are then transmitted to the requesting member 300. Therequesting member 300 includes an analytic engine 320 which is aresident application that houses algorithms for processing data and fortranslating the request 201. The request 201 is sent to the exchange hub400 (shown in detail in FIG. 2A) to be fulfilled to produce finalresults 205, as described above. The analytic engine 320 also processesthe request 201 to acquire statistics on locally accessible data.

Thus, the analytic engine 320 is proprietary computer software that canbe installed at the data source 220 or requesting member 210/300 thatallows both the querying of resident data, the querying of other datasources in the exchange, and the combining of resident data with resultsfrom the exchange to provide statistically meaningful results.

Resident data may be contained in a resident data repository, which isshown as a medical records database 330 in FIG. 3. In the illustratedexample of FIG. 3, the records database 330 contains, for example,composite medical data collected from a variety of locations such as labdata 340 a, electronic medical record (EMR) 340 b and other sources 340c. The records database 330 is also capable of accessing other locationsusing known data interfaces as needed to access the respectivelocations. For example, standard EMR or site specific interfaces may beused.

The composite medical data stored in the resident data repository 330may contain both private and non-private data. For the purposes of thisdiscussion, private data generally means data that contains potentiallyidentifying information that should not be publicly released. Privatedata is often intertwined with non-private data and it is typicallydifficult to automatically and efficiently separate the two.

The requesting member 300 (and 220) uses the analytic engine 320 toacquire data, both private and non-private, from the resident datarepository 330 according to the request 201 to create summary datamodule 350. Thus, a request 201 may be sent to the medical recordsdatabase 330 and information 202 may be returned to the analytic engine320. The analytic engine 320 can then analyze and aggregate theinformation 202. Thus, the analytic engine 320 performs an analysis ofrelevant data in the resident data repository 330 to produce summarydata 350 that is also analyzed and aggregated. Only summary data istransferred to the exchange and this prevents the transfer of sourceidentifying information. Thus, the summary data module 350 may includedata (results 206) from the final results 205 and the information 202that is further analyzed and aggregated.

The summary data module 350 may include a data construct with two ormore logical dimensions containing at least (a) a set of core cellsencoding specific data points, (b) a grand total point, (c) a subtotalline for each pair of dimensions, and (d) a subtotal plane for each pairof dimensions. For example, the summary data module 350 may represent alogical data structure having three dimensions: Industry, Department,and Direction. If the Industry dimension includes two values (i.e.,Automotive and Telecom), the Direction dimension includes two values(i.e., Send and Receive), and the Business Unit dimension includes threevalues (i.e., Sales, Development, and Consulting), then the subtotalline may contains values indicating, for example, how many total e-mailmessages were sent and received by the sales department, how many weresent and received by the development department, and how many were sentand received by the consulting department.

The number of values for each dimension also results in core cellsforming a two-by-two-by-three cube of cells. Each cell contains the datafor one particular combination of the values for each dimension. Forexample, the sender and recipient of a particular message may bothbelong to the consulting department. The sender and recipient may bothalso belong to the industry vertical section associated with the telecomindustry (i.e., the Telecom vertical). Therefore, when processing such amessage, data integrating component increments the values in core cells.As a result, grand total point, subtotal lines, and subtotal planeswould also reflect those incremented values. Each cell thereforecontains summary data for one particular subset of documents. Thisprocess is not apparent to the information requestor or to the source ofthe information.

In alternative embodiments, the summary data module 350 may have morethan three dimensions. For example, a summary data module 350 thatcontains organization-specific document statistics derived from e-mailmessages include all of the dimensions described above, as well asdimensions for counting e-mail messages between each pair of unitswithin the organization.

The analytic engine 320 creates results 206 based upon the final results205 and the information 202, after the information 202 has first beenanalyzed and/or aggregated. These results 206 are converted by theanalytic engine 320 using known techniques to produce a report ofaggregated data 360. The report of aggregated data 360 may varyaccording to the user and the request 201. It should be noted that thereport of aggregated data 360 does not generally contain any informationon the data sources 340 a-c, only statistics acquired from theselocations. In this way, the locations 340 a-c have no incentive toconceal adverse information.

As discussed earlier, in accordance with embodiments of the invention,the exchange members 220 and 300 may be compensated for providinginformation. This encourages participation in the exchange 200,especially for cross-industry purposes where the data and analysis areused to answer questions that are not critical to the localorganization. For example, the pharmaceutical industry and itsregulators could use the exchange 200 to search for incidence of rareadverse events. This search would lead them, not just to their ownindustry, but to the medical records stored electronically at hospitalsand other care providers. While the providers have a general interest inpromoting the discovery of adverse events, it is not operationally apriority and is viewed as the responsibility of the pharmaceuticalindustry and the regulators. By providing income to the providers, theproviders are encouraged to allow their data to be used for suchpurposes, especially so since privacy will be maintained and noindividual records nor the data source will ever be released.

The requesting member 300 may further include a firewall 380, that usesknown technology to monitor request transmissions 201 and information202 from the hub 400. The firewall 380 provides security to preventunauthorized access to information contained in the requesting member300, or the non-requesting exchange members 220.

FIG. 4 illustrates the operation of the exchange hub 400 in greaterdetail. FIG. 4 shows that the exchange hub 400 may include an analyticengine 410, a summary data module 420 and a central billing module 430.The analytic engine 410 is proprietary software that runs at theexchange hub 400 and can answer, in a statistically meaningful way,questions that require detailed (individual) data, without giving anyaccess to the individual data items or the data source and, in fact,through the aggregation techniques, hide the source of the data.

Specifically, the analytic engine 410 is a resident application thathouses algorithms for processing data statistics and for translating therequest 201 as needed for transmission through the aggregation dataexchange 200. In operation, the exchange hub 400 can receive a request201 from a requesting member 300. The analytic engine 410 can processthe request 201 and then transmit the processed request 201 to thenon-requesting members 220. The analytic engine 410 also processes therequest 201 to examine resident data located in a summary date module420. Aggregated and analyzed results 204 from the non-requesting members220 are transmitted to the analytic engine 410. The results 204generally do not come at the same time, but instead in cycles ascollected and processed by each of the non-requesting members 220. Thus,the first results 204 aggregated and analyzed by one of thenon-requesting members 220 can populate the summary data module 420. Asadditional results 204 arrive from other non-requesting members 220, theanalytic engine 410 updates the summary data module 420 with the newstatistics. Final results 205 are produced based upon the results 204that are further aggregated with additional results coming from thenon-requesting members 220 as well as data resident at the exchange hub400. These final results 205 are transmitted to the requesting member300.

The analytic engine 410 can also update a central billing and module 430to credit the various non-requesting members 220 contributing results204, as described above.

It should be noted that the while the exchange hub 400 of FIG. 4 is acentralized dedicated device, in multi-tiered embodiments, there may bea centralized exchange hub in communication with multiple “second-tier”hubs which behave as hubs and also behave has requesting members.

EXAMPLES

Some examples of requests that might be made to the exchange 200 are nowdescribed. Most of the examples are in the health care andpharmaceuticals space, although the exchange 200 is not limited to thatindustry. Along with the examples, some details of the steps necessaryto service that request are provided.

Example 1

Which of the two competing brands of coronary stents (or other medicaldevice) has the lowest complication rate? Typically, no hospital hasenough data to answer the question definitively, but combining data froma number of hospitals could address. Accordingly, this type of questionmay be addressed using the data collection exchange 200 of the presentinvention. First, the question needs to be clarified and put in preciseenough terms that the solution can be posed as a database query. This isa standard step, not related directly to the aggregation, but is morecomplicated, because the data can be represented differently atdifferent sites. Types of questions to be defined in this exampleinclude:

-   -   What events are considered complications? (e.g., consider        thrombosis only)    -   In what timeframe after implanting the stent must the event        occur to be considered related? (e.g., 3 months).    -   Are there other variables to be controlled for? (e.g., severity        of illness, age, etc)

Each of these questions in then translated into a database query asneeded to collected data from relevant data sites 220. Thus, in thisexample, relevant searchable data may include:

-   -   Total number of patients receiving stents    -   Number receiving type A and number receiving type B    -   Number receiving type A that had complications within 3 months        of surgery    -   Number receiving type B that had complications within 3 months        of surgery

If a standard EMR is used at the data sources 340, then the datadefinition is translated into a query against the EMR fields. Thepercentages for each type of stent could be calculated at each exchangemember 220 and sent to the central exchange server 400, along with thetotal number for each stent type. The information is assembled intoexchange format and transmitted to the data analysis tool 410.

If the desired information in this example is not in standard EMRformat, then for each site 220, a mapping from local format to theexchange standard format is be predefined. That mapping is used tospecify the data available for use in requests. The translation to theexchange standard format is performed as part of gathering the data forthe request. Note that even if an EMR standard exists, data may not bemaintained natively in that standard format, and so a translation of therequest to a local request may be needed.

The analytic engine 410 at the exchange server 400 gathers data from alldata sources 220, aggregates that data, and computes the complicationrates. Results are then returned to the requestor 210/300. The requestor210/300 may examine the results and decide that more information isneeded about the patients to clarify the results. For example, do olderpatients do better on type of treatment, even if the overall rates donot differ significantly.

The requestor 210/300 can then formulate a follow-up query 201 thatmakes use of previous results. The follow-up query may then betranslated into a data query against the previous result sets and sentto the sites 220 for computation. Work on the following query continuesas described above. More details of this type of request are describedbelow in Example 2.

Example 2

The complication rates for two stents: In this case, the requestor210/300 wishes to know the relative complication rates of two stents,controlled for age and severity of disease. As in the previous example,the first step in the data collection and aggregation is to clarify thequestion. In this example, the question clarification requiresspecifying the strata to be used for age and severity. These criteriamay differ for different requests and, consequently, may not be part ofthe common schema. For example, patients might be aggregated into thefollowing age “bins” or categories: 30-40, 41-50, 51-60, 61-70, 71 andabove. Similarly categories are then provided for severity of initialdisease.

According to these questions at each data collection site 220, patientsare counted by type of stent, age, and severity of disease. Thisstratification creates numerous subgroups. For example, if there are 5age categories and three severity categories, then there are 15subgroups for each of the two stent types, or 30 totals categories to besent to the Exchange. If stratification becomes too detailed, it canpotentially provide a backdoor to confidential identifying information.As described above, threshold criteria may be specified at local sitesto prevent to prevent the distribution of statistical data from alimited number of data points.

At the exchange server 400, each of the subgroups is summed with thecorresponding subgroups from every other site. Evaluations can beperformed to determine whether results are different in differentsubgroups, and whether there is sufficient data in each subgroup for theresults to be valid. In needed for analysis, subgroups can be combined.For example, evaluation can be done on severity of disease versus typeof stent, without regard to the patients' ages. These are standardcalculations, but it should be appreciated that the individual recordsare hidden at source site 220 and never revealed to the requestor210/300. Preferably, the exchange server 400 combines the data subgroupsas they arrive, and keeps no site-identifying information. Thus, inaddition to protecting confidential patient information, but otherconfidential information is protected as well, such as physicianinformation, hospital information, exact dates of treatment, etc.

Other Examples

The present invention may be used to phrase questions in the form ofpredictions. For instance, cardiologists might be interested in beingable to predict which patients are likely to suffer strokes after bypasssurgery, and known neural net algorithms might be used to provide suchpredictions. The execution of those algorithms may be divided betweenmember sites and the hub to provide maximum distance between theindividual record data and the aggregated data before sending to thehub. The present invention model is fully compatible with this and othertypes of known data mining techniques.

The present invention has further application in site-specific analysisor benchmarking. For instance, the present invention could be used todetermine whether a medical treatment data aggregation outcomecorrelates with the number of surgeries performed annually at that siteor with the number of surgeries performed by a surgeon. In these typesof cases, the data aggregation system 200 does not want to identify thesite or the surgeon, so the solution is to associate the site orsurgeon's performance numbers with the patients before aggregating at adata source 220. Typically, the data source 220 will have only one binin the aggregation but other data, such as the surgeons' numbers can becombined in to this bin. Then those numbers can be aggregated again atthe exchange. In this way, the only information transmitted from thedata source 220 is a number performed procedures and not patients'identities.

In another application of the present invention, complication rates frombypass surgery (or other medical procedures) for different hospitalscould be prepared. This query requires some sort of site identifier forcomparison. The exchange server 400 can provide a random number to beused as an exchange member 220 (or hospital) identifier in sending outthe query. The exchange member 220 can retain that identifier andassociate the identifier with the query, but the exchange server 400retains no record of the association. The exchange server 400 compilesthe results, returns the fully aggregated data, not broken down by site,to the requestor 210/300. The exchange server at the requestor 210/300then compares the results to the global results, and so no comparisonsare done or retained at the exchange server 400. Of course, thecategorization of similar data source exchange members 220 shouldprovide large enough groupings that no single hospital (exchange member220) is alone in a group.

The foregoing description of the preferred embodiments of the inventionhas been presented for the purposes of illustration and description. Itis not intended to be exhaustive or to limit the invention to theprecise form disclosed. Many modifications and variations are possiblein light of the above teaching. It is intended that the scope of theinvention be limited not by this detailed description, but rather by theclaims appended hereto. For example, embodiments of the presentinvention may employ known statistical methods not based on data cubesto collect and aggregate relevant data. Thus, many embodiments of theinvention can be made without departing from the spirit and scope of theinvention.

1. A data collection method, comprising: a. a computer receiving a datarequest from a requestor; b. the computer formatting the request andforwarding the formatted request to at least one data location; the atleast one data location retrieving data comprising raw information,confidential information, or personal data responsive to the formattedrequest; c. using the data to generate summary data that does notcontain the raw information, confidential information, or personal data;and transmitting the summary data to the computer; the computeraggregating and analyzing the summary data to ensure that the requestorwill have no ability to identify said data or data location; and d. thecomputer forwarding the aggregated and analyzed summary data to therequestor.
 2. The method according to claim 1, wherein the step ofcreating summary data includes the step of analyzing and aggregatingdata resident at the at least one data location.
 3. The method accordingto claim 1, wherein the requestor is at least one of a requesting memberand a requesting non-member.
 4. The method according to claim 1, whereinthe at least one data location is a non-requesting member.
 5. The methodaccording to claim 1, wherein data responsive to the formatted requestincludes at least one of private and non-private data.
 6. The methodaccording to claims 1, wherein the data responsive to the formattedrequest may include at least one of individual records and grouprecords.
 7. The method according to claim 1, wherein the step ofcreating summary data includes removing all source identifierinformation.
 8. The method according to claim 1, wherein the aggregatedand analyzed summary data does not include individual records, privaterecords and source identifiers.
 9. The method according to claim 1,wherein the step of the at least one data location retrieving dataincludes: a. forwarding the formatted request to at least one secondarydata location; b. creating summary data at the at least one secondarydata location; and c. transmitting the summary data to the at least onedata location.
 10. The method according to claim 1, wherein the step ofthe at least one data location retrieving data responsive to theformatted request includes multi-tiered data retrieval, analysis andaggregation.
 11. The method according to claim 1, further including thestep of generating a bill for the requesting non-member based upon therequesting non-member's data request.
 12. A system for data collectionand processing, comprising: a. an exchange hub; b. a requesting entitycommunicatively coupled to the exchange hub; c. at least one datalocation communicatively coupled to the exchange hub; wherein theexchange hub receives a request for information from the requestingentity, processes the received request, forwards the received request tothe at least one data location, aggregates all data responsive to therequest from the at least one data location so as to create anaggregated data result to ensure that the requestor will have no abilityto identify said data or data location; and transmits the aggregateddata result to the requesting entity; and d. wherein the at least onedata location retrieves data comprising raw information, confidentialinformation, or personal data responsive to the received request;aggregates the data to generate summary data that does not contain theraw information, confidential information, or personal data; andtransmits the summary data to the exchange hub.
 13. The system accordingto claim 12, further including a data agent located at the at least onedata location for retrieving data responsive to the request.
 14. Thesystem according to claim 12, wherein the requesting entity is at leastone of a requesting member and a requesting non-member.
 15. The systemaccording to claim 12, wherein the exchange hub aggregates and analyzesinformation received from the at least one data location and dataresident to the exchange hub.
 16. The system according to claim 12,wherein the aggregated and analyzed information does not include anysource identifier information.
 17. The system according to claim 12,wherein the at least one data location stores at least one of privateand non-private data.
 18. The system according to claim 12, exchange hubmay send a further request for information based upon data received inresponse to the request for information.
 19. The system according toclaim 12, wherein the at least one data location includes a multi-tieredgroup of secondary data locations containing responsive data.
 20. Thesystem according to claim 12, wherein the exchange hub includes ananalytic engine for processing requests for information received from arequesting entity, receiving data responsive to the requests forinformation, aggregating and analyzing the received data and forwardingthe aggregated and analyzed data to the requesting entity.
 21. Thesystem in accordance with claim 12, wherein the exchange hub includes asummary data module for storing the results of previous requests foraggregated and analyzed data.
 22. The system according to claim 12,wherein the exchange hub includes a central billing module forgenerating a billing statement for a requesting non-member.
 23. Amulti-tiered process for directing computer information comprising thesteps of: a. instructing a second computer to: i. receive informationfrom a first computer; ii. format the information into a request; saidrequest not containing identification information about the firstcomputer in the request; iii. and send the request to a decentralizedset of third computers; b. instructing one of the third computers to: i.process the request from the second computer; ii. look-up information inresponse to the request from a plurality of records containing rawinformation; iii. generate a reply to the request by aggregating theinformation from the records which includes analyzed data but does notinclude the raw information, and iv. send the reply to the secondcomputer; c. instructing the second computer to: i. receive a pluralityof replies from the set of third computers; ii. aggregate the replies togenerate summary data which does not contain the raw information; andiii. send the summary data to the first computer.
 24. The process ofclaim 23 wherein the raw information includes individual records,private records, or source identifier information.
 25. A data collectionmethod, comprising: a. a requestor generating a data request; b. acomputer receiving the data request from the requestor; c. the computerformatting the request and forwarding the formatted request to at leastone data location; d. the at least one data location retrieving dataresponsive to the formatted request, creating summary data based uponthe data and transmitting the summary data to the computer; wherein thesummary data does not include individual records, private records, andsource identifiers; e. the computer aggregating and analyzing thesummary data; f. the computer forwarding the aggregated and analyzedsummary data to the requestor.
 26. The method according to claim 25,wherein the step of creating summary data includes the step of analyzingand aggregating data resident at the at least one data location.
 27. Themethod according to claim 25, wherein the requestor is at least one of arequesting member and a requesting non-member.
 28. The method accordingto claim 25, wherein the at least one data location is a non-requestingmember.
 29. The method according to claim 25, wherein data responsive tothe formatted request includes at least one of private and non-privatedata.
 30. The method according to claim 25, wherein the data responsiveto the formatted request may include at least one of individual recordsand group records.
 31. The method according to claim 25, wherein thestep of creating summary data includes removing all source identifierinformation.
 32. The method according to claim 25, wherein the step ofthe at least one data location retrieving data includes: a. forwardingthe formatted request to at least one secondary data location; b.creating summary data at the at least one secondary data location; andc. transmitting the summary data to the at least one data location. 33.The method according to claim 25, wherein the step of the at least onedata location retrieving data responsive to the formatted requestincludes multi-tiered data retrieval, analysis and aggregation.
 34. Themethod according to claim 27, further including the step of generating abill for the requesting non-member based upon the requestingnon-member's data request.
 35. The data collection method of claim 25,wherein the summary data is not encrypted.
 36. The data collectionmethod of claim 25, wherein the summary data is not de-identified.
 37. Asystem for data collection and processing, comprising: a. an exchangehub; b. a requesting entity communicatively coupled to the exchange hub;and c. at least one data location communicatively coupled to theexchange hub, d. wherein the exchange hub receives a request forinformation from the requesting entity, processes the received request,forwards the request to the at least one data location, aggregates alldata responsive to the request from the at least one data location so asto create an aggregated data result; and transmits the aggregated dataresult to the requesting entity; and e. wherein the aggregating datadoes not include individual records, private records and sourceidentifiers.
 38. The system in accordance with claim 37, furtherincluding a data agent located at the at least one data location forretrieving data responsive to the request.
 39. The system according toclaim 37, wherein the requesting entity is at least one of a requestingmember and a requesting non-member.
 40. The system according to claim37, wherein the exchange hub aggregates and analyzes informationreceived from the at least one data location and data resident to theexchange hub.
 41. The system according to claim 37, wherein theaggregated and analyzed information does not include any sourceidentifier information.
 42. The system according to claim 37, whereinthe at least one data location stores at least one of private andnon-private data.
 43. The system according to claim 37, exchange hub maysend a further request for information based upon data received inresponse to the request for information.
 44. The system according toclaim 37, wherein the at least one data location includes a multi-tieredgroup of secondary data locations containing responsive data.
 45. Thesystem according to claim 37, wherein the exchange hub includes ananalytic engine for processing requests for information received from arequesting entity, receiving data responsive to the requests forinformation, aggregating and analyzing the received data and forwardingthe aggregated and analyzed data to the requesting entity.
 46. Thesystem in accordance with claim 37, wherein the exchange hub includes asummary data module for storing the results of previous requests foraggregated and analyzed data.
 47. The system according to claim 37,wherein the exchange hub includes a central billing module forgenerating a billing statement for a requesting non-member.
 48. Thesystem of claim 37, wherein the summary data is not encrypted.
 49. Thesystem of claim 37, wherein the summary data is not de-identified.